Goto

Collaborating Authors

 lookahead policy





Online Planning with Lookahead Policies

Neural Information Processing Systems

Real Time Dynamic Programming (RTDP) is an online algorithm based on Dynamic Programming (DP) that acts by 1-step greedy planning. Unlike DP, RTDP does not require access to the entire state space, i.e., it explicitly handles the exploration. This fact makes RTDP particularly appealing when the state space is large and it is not possible to update all states simultaneously. In this we devise a multi-step greedy RTDP algorithm, which we call $h$-RTDP, that replaces the 1-step greedy policy with a $h$-step lookahead policy. We analyze $h$-RTDP in its exact form and establish that increasing the lookahead horizon, $h$, results in an improved sample complexity, with the cost of additional computations. This is the first work that proves improved sample complexity as a result of {\em increasing} the lookahead horizon in online planning. We then analyze the performance of $h$-RTDP in three approximate settings: approximate model, approximate value updates, and approximate state representation. For these cases, we prove that the asymptotic performance of $h$-RTDP remains the same as that of a corresponding approximate DP algorithm, the best one can hope for without further assumptions on the approximation errors.



The Value of Reward Lookahead in Reinforcement Learning

Neural Information Processing Systems

In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards.




Review for NeurIPS paper: Online Planning with Lookahead Policies

Neural Information Processing Systems

Additional Feedback: COMMENTS AFTER REBUTTAL Thank you for your response. However, in this paper's case I find that the significance of the paper (i.e., support for your claim that "theoretical results provided in this work are important on their own") is severely lacking without experiments showing a link between this theory and an algorithm's performance in terms of measures like running time, number of 1-step Bellman backups, etc. ***Note: this is not a claim that every theoretical paper needs experiments; it applies only to this specific work, due to the theory issues mentioned in the original review.*** The rebuttal's attempted arguments against providing experiments really miss the mark: -- The rebuttal gives the "Beyond the one step greedy approach in RL" as an example of a paper similar in the degree of its theoretical focus to this submission, but that paper actually has experiments! Light experiments could do the job. That "Beyond the one step greedy approach in RL" paper that you mentioned yourself is a case in point.


Online Planning with Lookahead Policies

Neural Information Processing Systems

Real Time Dynamic Programming (RTDP) is an online algorithm based on Dynamic Programming (DP) that acts by 1-step greedy planning. Unlike DP, RTDP does not require access to the entire state space, i.e., it explicitly handles the exploration. This fact makes RTDP particularly appealing when the state space is large and it is not possible to update all states simultaneously. In this we devise a multi-step greedy RTDP algorithm, which we call h -RTDP, that replaces the 1-step greedy policy with a h -step lookahead policy. We analyze h -RTDP in its exact form and establish that increasing the lookahead horizon, h, results in an improved sample complexity, with the cost of additional computations.